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ABSTRACT 



Ankyrin (ANK) repeats are one of tiie most common amino acid sequence motifs 
tiiat mediate interactions between proteins of myriad sizes, shapes and functions. 
We assess their widespread abundance in Bacteria and Archaea for the first time and 
demonstrate in Bacteria that lifestyle, rather than phylogenetic history, is a predictor 
of ANK repeat abundance. Unrelated organisms that forge facultative and obligate 
symbioses with eukaryotes show enrichment for ANK repeats in comparison to free- 
living bacteria. The reduced genomes of obligate intracellular bacteria remarkably 
contain a higher fraction of ANK repeat proteins than other lifestyles, and the num- 
ber of ANK repeats in each protein is augmented in comparison to other bacteria. 
Taken together, these results reevaluate the concept that ANK repeats are signature 
features of eukaryotic proteins and support the hypothesis that intracellular bacteria 
broadly employ ANK repeats for structure-function relationships with the eukary- 
otic host cell. 
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INTRODUCTION 

Ankyrin (ANK) repeats are ubiquitous structural motifs in eukaryotic proteins. They 
function as scaffolds to facilitate protein-protein interactions involved in signal 
transduction, cell cycle regulation, vesicular trafficking, inflammatory response, 
cytoskeleton integrity, transcriptional regulation, among others {Mosavi et al, 2004). 
Consistent with the necessity of their function, amino acid substitutions in the ANK 
repeats of a protein (ANK- containing proteins) are associated with a number of human 
diseases including cancer (pl6 protein) {Tang, Fersht & Itzhaki, 2003), neurological 
disorders such as CADASIL (Notch protein) {Joutel et al, 1996), and skeletal dysplasias 
(TRPV4 protein) {Inada et al, 2012; Mosavi et al, 2004). In addition, variations in the 
amino acid sequence of the human ANKKl are associated with addictive behaviors such 
as alcoholism and nicotine addiction (Ponce et al, 2008; Suraj Singh, Ghosh & Saraswathy, 
2013). 

The structure of each individual 33 amino acid ANK repeat begins with a yS-turn that 
precedes two antiparallel a -helices and ends with a loop that feeds into the next repeat. 
These interconnected protein motifs stack one upon another to form an ANK domain 
(Gorina & Pavletich, 1996; Sedgwick & Smerdon, 1999). The prevalence and varied 
functionality of ANK-containing proteins in eukaryotes can be attributed to (i) the strong 
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degeneracy of the 33 amino acid repeat that allows for the specificity of individual 
molecular interactions, and (ii) the variability in the number of individual repeats in an 
ANK domain, which provides a platform for protein interactions {Li, Mahajan & Tsai, 
2006; Sedgwick & Smerdon, 1999). 

Because the ANK repeat was discovered in Saccharomyces cerevisiae, 
Schizosaccharomyces pombe, znd Drosophila melanogaster {Breeden & Nasmyth, 1987), 
they were quickly prescribed as a signature feature of eukaryotic proteins. Despite the 
conventional wisdom (until recently) and frequent citations in the literature that ANK 
repeats are taxonomically restricted to eukaryotes, there has been no systematic 
investigation to assess their distribution across the diversity of life. Several related 
questions on the comparative biology of ANK repeats can be addressed: Are 
ANK-containing proteins more prevalent in the domains Eukarya than Bacteria and 
Archaea and to what extent? What is the typical fraction of a proteome dedicated to 
proteins containing ANK repeats across the three domains of life? Are ANK-containing 
proteins distributed non-randomly with respect to taxonomy or lifestyle? 

In this study, we establish a new threshold on the distribution of ANK repeats across 
the tree of life. Further, the enrichment of ANK-containing proteins in symbiotic bacteria 
provides comprehensive support to experimental cases in which ANK-containing 
proteins promote interactions between bacterial and eukaryotic cells. 

MATERIALS AND METHODS 

ANK data acquisition and analysis 

All genome information was obtained from the SUPERFAMILY vl.75 database 
{SUPERFAMILY ; Wilson et ah, 2009), including the taxonomy, and number of 
ANK-containing proteins (Table SI). The SUPERFAMILY database currently contains 
protein domain information on 2,489 strains, where there can be more than one strain 
representing a single phylogenetic species. This database is an archive of structural and 
functional domains in proteins of sequenced genomes {Wilson et at., 2009), which are 
annotated using hidden Markov models through the SCOP (Structural Classification of 
Proteins) SUPERFAMILY protein domain classification {Gough et al, 2001; 
SUPERFAMILY). We note appropriate caution that ANK-containing proteins are 
identified based on a computational framework and are not experimentally confirmed. 
We used NCBI's Genome resource {NCBI Genome resource) to obtain total gene and 
protein numbers for each strain in the analysis. To determine the percent of a strain's total 
protein number (proteome) that is composed of ANK-containing proteins, the number of 
ANK-containing proteins was divided by the total number of proteins and multiplied by a 
factor of 100. Only strains with available total protein information were used in this 
analysis. For the bacterial class and lifestyle analysis, an average of the number and/ or 
percent of ANK-containing proteins for all strains of the same species were used for these 
analyses. For the lifestyle analysis, ANK-containing protein information on Cardinium 
hertigii was added to the analysis because detailed information regarding its 
ANK-containing proteins was recently published {Penz et al, 2012). 
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To analyze the amino acid sequence of ANK repeats and generate the consensus 
sequence for Archaea, we obtained the sequence ID of ANK-containing proteins from 
SUPERFAMILY vl.75 {SUPERFAMILY) and the amino acid sequence from NCBI's 
Proteins database {NCBI Protein resource). We used SMART {Letunic, Doerks & Bork, 
2012) to identify the number and location of each individual ANK repeat in the protein 
{Letunic, Doerks & Bork, 2012; Schultz et al, 1998). For the amino acid sequence identity 
analysis, individual ANK repeat sequences were aligned using MUSCLE using default 
parameters {Edgar, 2004) and the percent identity of the sequences was calculated in 
Geneious Pro 5.6.2 {Biomatters, 2010). To generate the archaeal ANK consensus 
sequence, all 132 ANK repeat sequences from the ANK-containing proteins identified in 
the SUPERFAMILY database were utilized. To generate the eukaryotic ANK consensus 
sequence, ANK repeat sequences from one ANK-containing protein from each phylum 
was identified in the SUPERFAMILY database, the ANK repeat was identified by SMART 
and utilized, resulting in a total of 153 ANK-repeat sequences (Table S2). When 
comparing ANK repeat sequences from two strains, the average of all combinations of 
ANK repeat comparisons was used. For the eukaryotic and archaeal consensus sequence, 
all indels and ends were trimmed after the ANK repeats were ahgned by MUSCLE. The 
consensus sequence was generated by Geneious. 

16S rRNA phylogenetic tree and independence analysis 

We selected one representative 16S rRNA sequence from each bacterial class and aligned 
them by MUSCLE in Geneious Pro 5.6.2 (Table S3). This alignment was then used to 
reconstruct a phylogenetic tree that reflects the well-established ancestry of the bacterial 
classes for a phylogenetic independence test of the abundance of ANK-containing 
proteins. Prior to building the tree, a DNA substitution model for the alignment was 
selectedby using jModelTest, version 2.1.3 using default parameters {Darriba et al, 2012; 
Guindon & Gascuel, 2003) A Bayesian phylogenetic tree was generated by MrBayes using 
the HKY85 IG model of DNA sequence evolution using default parameters {Huelsenbeck 
& Ronquist, 2001; Ronquist & Huelsenbeck, 2003; Hasegawa, Kishino & Yano, 1985). For 
testing phylogenetic independence of ANK-containing proteins, the PDAP program in 
Mesquite vs 2.75 was used to generate independent contrasts for the data in Fig. 3B using 
default parameters {Maddison & Maddison, 2006; Midford, Garland & Maddison, 2005) 
Phylogenetic Independence version 2.0 {Reeve & Abouheif, 2003) performed the Test For 
Serial Independence (TFSI) using default parameters based on the Bayesian tree. 

RESULTS 

ANK-containing proteins across the tree of life 

The consensus amino acid sequences for the ANK repeats in each domain of life are 
shown in Fig. 1 {Al-Khodor et al, 2010; Mosavi et al, 2004) (Table S2). There is a notable 
correspondence in amino acid identity and similarity across the domains, with the 
highest values between Eukarya and Bacteria (76.7% identity), followed by Archaea and 
Bacteria (73.3% identity), and then Eukarya and Archaea (66% identity). Despite the 
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Figure 1 ANK repeat consensus sequence across all domains of life. Comparison of consensus se- 
quences previously derived from (i) 153 Eukarya ANK repeat sequences (Table S2), (ii) 1 32 Archaea ANK 
repeat sequences and (iii) Bacteria ANK repeat sequences {Al-Khodor et al, 2010). The amino acid color 
scheme indicates that the amino acids share similar biochemical properties (polar uncharged, green; pos- 
itively charged, light blue; negatively charged, purple; hydrophobic, dark blue; glycine, orange; proline, 
yellow). [* This alanine (A) appears in equal proportions (16%) to lysine (K)]. 



conservation of the domain-specific consensus sequences, there can be substantial amino 
acid sequence diversity at each position of the ANK repeat. For example, this variation is 
evident in the Archaea, where the mean % of the sequences zt standard deviation that 
establishes each consensus amino acid is 49.6 =b 24.7% (Table S4). Indeed, seven amino 
acid positions form a consensus from less than one quarter of the sequences. 

Of the 2,489 strains analyzed here, 1,912 are from the domain Bacteria, 444 are from 
the domain Eukarya, and 133 are from the domain Archaea. All 444 eukaryotic strains 
except one {Saccharomyces cerevisiae CLIB382, which lacks a completely annotated 
genome) contain at least one ANK-containing protein (Fig. 2, Table SI). 51% of bacterial 
strains (981/1912) and 11% of archaeal strains (15/133) harbor at least one 
ANK-containing protein (Fig. 2A). When strains are grouped into genera, we similarly 
find that 56% of bacterial genera (308/549) and 9% of archaeal genera (6/69) contain 
species that encode at least one protein with an ANK repeat. 

For those strains with at least one ANK-containing protein, the average number and 
normalized percent of ANK-containing proteins per strain are shown for each major 
domain of life in Fig. 2B and 2C. The differences in the relative fraction of the proteome 
dedicated to proteins with ANK repeats are significant between the domains 
(Mann- Whitney Up < 0.00001). 

ANK-containing proteins in bacteria 

The percent of bacterial strains that contain multiple ANK-containing proteins rapidly 
declines as the cutoff number of ANK-containing proteins per proteome increases to four 
and higher (Fig. 3A). To glean which phylogenetic groups of bacteria harbor an enriched 
fraction of ANK-containing proteins, 24 bacterial classes spanning 202 bacterial strains 
encoding > four predicted ANK-containing proteins were analyzed. 

The class with the highest fraction of > four ANK-containing proteins was 
Sphingobacteria (Fig. 3B and 3C). To our knowledge, it is the first report that this class of 
typically free-living bacteria putatively encode ANK-containing proteins. Interestingly, 
many of the classes with a high percentage of ANK-containing proteins in Fig. 3B and 3C 
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Figure 2 ANK-containing protein analysis across all domains of life. (A) Bar graph of the average percent of the strains in each domain that 
have one or more ANK-containing proteins. The total number of strains analyzed and the number of strains with more than one ANK-containing 
protein are listed below the graph. (B) Bar graph of the average number of ANK containing proteins in strains of each domain. The average number of 
ANK-containing proteins in each domain is listed below the graph. Error bars represent standard deviation. (*P < 0.05, ** P < 0.000001, Two-tailed 
Mann- Whitney U; ANOVA P < 0.000001). (C) Bar graph showing the average percent of the proteome composed of ANK-containing proteins in 
each domain. Error bars represent standard deviation. {*P < 0.000001, Two-tailed Mann- Whitney U; ANOVA, P < 0.000001). 



cluster with lineages that form symbioses with hosts, including Spirochetes, Chlamydia, 
and various sub-groups of Proteobacteria. As endosymbioses have independently evolved 
across the tree of Bacteria, the taxa are, as expected, scattered across the bacterial tree such 
that the relative abundance of ANK-containing proteins across the 24 classes of Bacteria 
is independent of phylogenetic history {p = 0.32, PI test. Reeve & Abouheif, 2003). 

Enrichment of ANK-containing proteins in bacterial symbionts 

To corroborate the enrichment of ANK-containing proteins in symbiotic bacteria, we 
categorized each taxon with four or more ANK-containing proteins into three bacterial 
lifestyles: (i) free-living species that solely replicate outside of host cells, (ii) facultative 
host-associated (intracellular and extracellular) species that can use a host for replication, 
and (iii) obligate intracellular species that replicate strictly within host cells. We assigned 
these three lifestyles following our previous annotations {Newton & Bordenstein, 2011) 
and searching the primary literature (Table S5). 

Our comparisons reveal a striking correlation between replication strategy and 
abundance of proteins containing ANK repeats. Both obligate intracellular and facultative 
host- associated bacteria contain, on average, a significantly, higher absolute number of 
ANK-containing proteins than those that are free-living (Fig. 4A, Mann-Whitney U 
p < 0.001, ANOVAp < 0.00003), despite the notable fact that free-living species have 
significantly larger proteomes (Fig. 4C, Mann- Whitney Up < 0.01 for all comparisons, 
ANOVAp < 0.00001). Facultative host-associated strains have the most expansive 
repertoire of ANK-containing proteins based on absolute protein numbers (Fig. 4A 
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Figure 3 Analysis of ANK-containing proteins in Bacteria. (A) Bar graph of the percent of bacterial 
strains analyzed {y axis) with the specified number of ANK-containing proteins {x axis). The number 
above the bars on the graph lists the number of strains with the specified number of ANK-containing 
proteins. (B) Consensus phylogeny of 16S rRNA sequences from one species (randomly selected) in each 
class. (C) Species analysis of bacterial classes that contain four or more ANK-containing protein encoding 
genes (only classes with 5 or more represented species were included in this analysis). The fraction in 
parentheses represents the number of species with four or more ANK-containing proteins in the bacterial 
class over the total number of species in that bacterial class. 



and 4D), likely owing to their dual capacity to interact with eukaryotic host cells as well as 
retain a large genome. Consistent with these findings, a majority of the bacterial strains 
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Table 1 Bacterial species with 20 or more ANK-containing proteins in our analysis. 
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that contained 20 or more ANK-containing proteins are obligate intracellular or 
facultative host-associated microbes, while only one is free-living (Table 1). 

After normalizing the dataset by the total number of proteins, the fraction of the 
proteome containing ANK-containing proteins is highest in obligate intracellular species 
(Fig. 4B and 4E). The percentage of ANK-containing proteins is inversely related to 
proteome size across bacterial lifestyle. In fact, a significant difference in the abundance 
of proteins with ANK repeats is broadly evident between the lifestyles (Mann-Whitney U 
p <0.001 for all comparisons, ANOVA/) < 0.00001). When considering both the 
abundance of proteins with ANK repeats and limited proteome size, obligate intracellular 
bacteria have a remarkably high composition of ANK-containing proteins that not only 
exceeds that of other bacterial lifestyles, but also is comparable to the composition of 
eukaryotes in Fig. 2C. 

Enrichment of repeats in ANK-containing proteins in bacterial 
symbionts 

Obligate intracellular bacteria also harbor significantly more ANK repeats per protein 
(Fig. 5A; Table S6). On average, an obligate intracellular microbe contains 6.1 ANK 
repeats per ANK-containing protein, while free-living and facultative host-associated 
microbes only contain 4.6 and 4.3 ANK repeats, respectively (ANOVA/> = 0.012, 
pairwise tests between the lifestyles, t-testp < 0.012). As discussed below, these 
differences likely affect protein stability. 
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Figure 4 Lifestyle analysis of bacterial species with four of more ANK-containing proteins. An average 
of the number or percent of ANK-containing proteins for all strains of the same species was used for these 
analyses. FL, FHA and O denote free-living, facultative host-associated and obligate intracellular bacteria, 
respectively. (A) Bar graph of the average number of ANK-containing proteins in species with four of more 
ANK-containing proteins. Error bars represent standard deviation. (*P < 0.001, **P < 0.00001, Two- 
tailed Mann- Whitney U; ANOVA, P < 0.00003). (B) Bar graph of the average percent of the proteome 
composed of ANK-containing proteins in species with four of more ANK-containing proteins. Error bars 
represent standard deviation. {*P 0.001, **P < 0.0001, ***P < 0.00001, Two-tailed Mann- Whitney U; 
ANOVA, P < 0.00001). (C) Bar graph of the average total number of proteins in the proteomes of species 
with four of more ANK-containing proteins. Error bars represent standard deviation. (*P < 0.01, **P < 
0.00001, ***P < 0.000001, Two-tailed Mann- Whitney U; ANOVA, P < 0.00001). (D) Bar graph of percent 
of species in each lifestyle that contain the specified number of ANK-containing proteins (example: 74% 
of obligate intracellular species, 58% of facultative host associated species, and 28% of free-living species 
of bacteria contain six ANK-containing proteins). (E) Bar graph of the percent of species in each lifestyle 
that contain the specified percent of ANK-containing proteins. 
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Figure 5 Individual ANK repeat number and amino acid sequence identity analysis. (A) Bar graph of the average number of ANK repeats in 
ANK- containing proteins for free-living (FL), facultative host-associated (FHA) and obligate intracellular (O) bacteria. Error bars represent standard 
deviation (*p — 0.0127, **p — 0.0036, t-test). For a list of strains analyzed, refer to Table S6. (B) Bar graph of the average percent of amino acid 
identity of the ANK repeats from the listed species with Wolbachiaw Mel ANK repeats. Strains analyzed listed in Table S8. Error bars represent 
standard error. 



Effect of symbiont transmission on ANK-containing proteins 

To determine if the mode of transmission of obligate intracellular bacteria associates with 
the abundance of ANK-containing proteins, we employed a previously published list of 
vertically and horizontally transmitted obligate intracellular bacteria (Table S7) {Newton 
& Bordenstein, 2011). Based on the mean of all strains from the same species (a species 
average), horizontally transmitted taxa (n = 24) contain more ANK-containing proteins 
than vertically transmitted ones (n = 6) (5.33 vs. 1.66, Mann- Whitney U = 0.174), and 
have a higher percentage of their proteome dedicated to ANK-containing proteins (0.41% 
vs. 0.12%, Mann-Whitney U p = 0.191). However, these differences are not statistically 
different likely owing to the small sample size in the vertically transmitted group. If we 
analyze the data from each strain, the differences between horizontally (n = 31) and 
vertically transmitted taxa {n = 8) are marginally insignificant for the abundance of 
ANK-containing proteins (5.13 vs. 0.88, Mann- Whitney U p = 0.062) and proportion of 
ANK-containing proteins (0.39% vs. 0.11%, Mann- Whitney U;? = 0.08). 

ANK amino acid sequence identity across bacterial lifestyles 

Two explanations for why obligate intracellular bacteria have a greater fraction of proteins 
with ANK repeats and ANK repeats per ANK-containing protein than facultative 
host-associated and free-living bacteria are: (i) ANK-containing proteins are adaptive to 
bacteria with an intracellular lifestyle or (ii) ANK-containing proteins experience 
frequent horizontal transfer between co-infecting, obligate intracellular microbes. 

Fig. 5B demonstrates that there is no conservation in the ANK repeat amino acid 
sequence between species of the same lifestyle. For instance, when comparing the amino 
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acid sequence of Wolbachia (strain wMel) ANK repeats to the ANK repeat sequences 
from other obhgate intracellular, facultative host-associated and free-living microbes, 
there are no significant differences in the amount of sequence identity between lifestyles 
(Fig. 5B; Table S8). Surprisingly, Wolbachia ANK repeats are no more or less similar in 
sequence to each other than ANK repeats from other obligate intracellular, facultative 
host-associated and free-living species. Even the ANK repeat amino acid sequences of 
species in the same order have very little sequence identity (Fig. SI). This low level of 
sequence identity within and between unrelated taxa may be due to degeneracy in the 
ANK repeat amino acid sequence itself {Li, Mahajan & Tsai, 2006) and does not permit a 
demarcation of the two explanations above. 

ANK-containing proteins in archaea 

Of the 133 archaeal strains, 11% contain ANK-containing proteins (Fig. 2). Of these 
strains, the average number of ANK repeats per protein was 5.25, and four species 
contained more than one ANK-containing protein in their proteome (Fig. 6A). 
Interestingly, the ANK-containing proteins in some archaeal genera are conserved, while 
others are not. In the Methanosarcina genus, two species have one ANK-containing 
proteins with 66.9% amino acid identity. However, the three species with 
ANK-containing proteins from the Pyrobaculum genus have very different amino acid 
sequences (Fig. 6B). Other archaeal genera with ANK-containing proteins include 
Acidianus, Halogeometricum, Metallosphaera, Methanocella, Methanococcus, 
Methanothermococcus, Sulfolobus, Thermofilum, and Thermoplasma (Table S9). 

DISCUSSION 

A central finding of this comparative study is that ANK repeats are more prevalent in 
bacterial species than generally recognized in the current literature, with over half of all of 
the 1,912 bacterial strains analyzed containing ANK-containing proteins. Far from being 
rare or even exclusive to certain phylogenetic groups of related bacteria, ANK repeats in 
Bacteria are widely distributed protein motifs. We do note that this analysis is limited to 
the strain information present in the SUPERFAMILY database (SUPERFAMILY). While 
not exhaustive, this database and our analysis spans a broad spectrum of bacterial 
domains, including 1912 bacterial strains, representing 992 species and 52 phylogenetic 
classes. Since certain strains of Bacteria that have relevance to human health naturally 
receive attention and have been well sampled, it is possible that the SUPERFAMILY 
dataset is not representative of the microbial diversity of the natural world, but rather is 
enriched in bacterial species that affect human health. Nonetheless, this analysis is the 
most comprehensive survey of ANK repeat distribution and abundance to date, leading 
us to conclude that previous assumptions about the rarity of ANK repeats outside of 
eukaryotes are exaggerated. 

Evolutionary theories on the origins of the ANK repeat have evolved over time. 
Originally, it was assumed that prokaryotic ANK-containing proteins were obtained via 
horizontal gene transfer (HGT) from eukaryotic hosts, indicating that the ANK repeat 
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Figure 6 Analysis of ANK-containing proteins in archaeal strains. (A) Bar graph of the percent of 
archaeal strains analyzed with the specified number of ANK-containing proteins. The number above the 
bars on the graph lists the number of strains with the specified number of ANK-containing proteins. 
(B) Chart of the percent amino acid identity between the amino acid sequences of Pyrobaculum ANK- 
containing proteins. 



originated in eukaryotic proteins {Bork, 1993). While the short sequence and divergence 
levels of the repeat motif between taxa precludes a clear inference of the origin of ANK 
repeats, there are several reasons why a single, common ancestor may be just as Hkely as 
horizontal transfer of the ANK repeat between the phylogenetic domains. First, we 
showed that the consensus sequences between the three domains are roughly similar, thus 
making it difficult to rule out that ANK repeat evolution follows the phylogeny of the 
domains. Second, there are several species of Archaea and non-host associated microbes 
that have ANK-containing proteins, which may be indicative of an older origin of the 
ANK repeat. Finally, although the results indicate that host-associated microbes have an 
increased fraction of ANK-containing proteins in comparison to free-living microbes, all 
lifestyles can harbor such proteins, specifying that they provide broader advantages to the 
cell. Whether or not these proteins were inherited by HGT or evolved by descent with 
modification from a common ancestor, the distribution for these proteins in Bacteria and 
Archaea has been unknown and warrants functional and evolutionary analyses. 
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While ANK repeats in eukaryotes are ubiquitous structural motifs that facilitate a 
myriad of protein-protein interactions, our analysis reveals that ANK repeats cluster to 
some degree in symbiotic bacteria involved in microbial-host interactions. Recent studies 
of host-associated bacterial species, including, Legionella pneumophila {Al-Khodor et al, 
2010; de Felipe et al, 2008), Anaplasma phagocytophilum {IJdo, Carlson & Kennedy, 2007), 
and Ehrlichia chajfeenis {Zhu et al, 2009), show that ANK- containing proteins can be 
secreted through a type IV secretion system into the cytoplasm of their host and alter host 
gene expression and interfere with its hosts' microtubule directed vesicular transport, 
respectively {Garcia-Garcia et al., 2009; Pan et al., 2008; Zhu et al, 2009). Based on our 
data, bacterial ANK- containing proteins may play a significant role in ensuring the 
pathogen's survival within the host cell. 

Protein folding studies indicate that higher numbers of ANK repeats in a protein 
results in increased structural stability {Hagai et al, 2012; Mello et al, 2005; Wetzel et al, 
2008). We observed that obligate intracellular microbes, on average, have 6.1 ANK 
repeats per protein, in comparison to 4.6 and 4.55 in bacteria with free-living and 
facultative host associated replication, respectively (Fig. 5A). This significant difference 
suggests that the proteins with ANK repeats in obligate intracellular bacteria have a more 
stable structure than those from bacteria in the other two lifestyles. Furthermore, a study 
on the folding dynamics and stability of DARPins (designed ankyrin repeat proteins) 
composed of identical ANK repeats designed from a consensus ANK repeat found that 
when the number of ANK repeats was reduced from 7 to 4, the stability of the protein was 
substantially reduced {Wetzel et al, 2008). Coincidentally, this difference in the number 
of ANK repeats is similar to that observed between obligate intracellular bacteria and 
free-living/facultative host associated lifestyles in our analysis. Taken together, we suggest 
that the ANK- containing proteins in obligate intracellular species have, on average, a 
more stable structure that could potentially underlie more effective interactions between 
bacterial effector proteins and host proteins. Interestingly, recent proteomic evidence has 
indicated that some obligate intracellular bacteria, including Blochmonnia chromaiodes 
and Buchnera, express an abundance of chaperones, such as GroEL, in an effort to provide 
greater stability for proteins that have accumulated deleterious mutations {Fan et al, 2013; 
Poliakov et al, 2011). It is possible that enhanced stability of the ANK domain conferred 
by the accumulation of additional ANK repeats is not required to provide stability for 
protein interactions, but is rather part of an overall effort to increase protein stability. 

On a related note, a comparative study on ANK domain- encoding genes (ANK genes) 
present in species of Wolbachia pipientis that inhabit Drosophila found that these ANK 
genes are rapidly evolving due to homologous and illegitimate recombination via the 
short direct repeat sequences {Siozios et al, 2013). The authors speculated that since 
stress-related genes also contain these types of direct repeats, which allows for rapid 
change in challenging environmental conditions, ANK- containing proteins maybe used 
in similar stressful conditions such as directly interacting with host tissues or proteins 
{Rocha, Matic & Taddei, 2002; Siozios et al, 2013). This inference complements the 
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findings of our analysis because the enriched repertoire of ANK- containing proteins and 
ANK repeats per protein in obhgate bacteria may aid intimate host-microbe interactions. 

A number of pathogenic microbes that contain ANK- containing proteins have been 
identified in this study. For instance, the microbe with the greatest number is the 
spirochete, Brachyspira hyodysenteriae,which. remarkably has 60 ANK-containing 
proteins. B. hyodysenteriae is a classic gastrointestinal pathogen and the causative agent of 
a wide range of diarrheal diseases in pigs that naturally leads to significant economic 
ramifications {ter Huurne & Gaastra, 1995). Of B. hyodysenteriaes 60 ANK-containing 
proteins, 34 contain a signal sequence for secretion (Table SIO) suggesting that many of 
these proteins, if expressed, are exported from the microbe into its host that may facilitate 
pathogenesis {Bellgard et ah, 2009; Mappley et al, 2012). 

The number of ANK-containing proteins within a group of closely related taxa can be 
extremely variable. In the order Campylobacterales, Helicobacter hepaticus has 13 such 
proteins, Helicobacter mustelae has two proteins and Helicobacter cinaedi has three. The 
remaining five Helicobacter species in our analysis do not have any ANK-containing 
proteins (Table Sll). The related Campylobacter species, including Campylobacter jejuni, 
have two to three (Table Sll), and some ANK-containing proteins in Helicobacter and 
Campylobacter are probable orthologs (Fig. S2). Interestingly, one ANK-containing 
protein present in both H. cinaedi and C. jejuni is required for C. jejuni colonization due 
to its capacity to reduce levels of reactive oxygen species (ROS) in the cell {Flint, Sun & 
Stintzi, 2012). Finally, the increased repertoire of ANK-containing proteins in H. 
hepaticus, particularly the three proteins with secretion signal sequence and the two 
proteins with transmembrane domains (Table S12), may associate with this species' 
unique infection of the lower bowel and liver of its host, resulting in inflammatory bowel 
disease, chronic hepatitis, and liver cancer {Suerbaum et al, 2003). 

Although the vast majority of the species with the highest number of ANK-containing 
proteins are host associated, Desulfomonile tiedjei is an outlier because it harbors 42 such 
proteins (Table 1). D. teidjei is an anaerobic, free-living bacteria that dechlorinates 
hydrocarbons, such as tetrachloroethylene (PCE) and trichloroethylene (TCE) {Deweerd 
& Suflita, 1990). The fact that D. tiedjei also harbors 42 ANK-containing proteins, 19 of 
which also contain signal sequences, has, to our knowledge, not been reported nor 
discussed in this microbe's bioremediation capabilities (Table S13). Although it 
dechlorinates PCE and TCE, D. teidjei cannot use these chemicals as a carbon source. 
Instead, D. teidjei lives syntrophically with other anaerobic microbes and relies on them 
for nutrients {Shelton & Tiedje, 1984). We speculate based on widespread enrichment of 
ANK-containing proteins in symbionts that these ANK-containing proteins could play a 
role in this interaction. 

CONCLUSIONS 

Our analysis of the ANK protein motif augmented with the taxon lifestyles and 
phylogeny, upgrades the magnitude of ANK repeat biology across the diversity of life. The 
enrichment of ANK-containing proteins in host-associated bacteria signifies that they are 
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not evolutionarily restricted to unique types of Bacteria or Archaea, but instead can 
independently thrive in diverse taxa. The functional roles of ANK-containing proteins in 
Bacteria and Archaea remain understudied and will be an exciting frontier for future 
investigations of protein interactions between the different domains of Hfe. 



ADDITIONAL INFORMATION AND DECLARATIONS 



Funding 

This research was made possible by NTH awards F32 GM 100778 and 5T32HD007043-34 
to KKJ, and ROl GM085163 to S.R.B. The funders had no role in study design, data 
collection and analysis, decision to publish, or preparation of the manuscript. 

Grant Disclosures 

The following grant information was disclosed by the authors: 
NTH awards: F32 GM 100778 
5T32HD007043-34 
ROl GM085163. 

Competing Interests 

The authors declare that they have no competing interests. 

Author Contributions 

• Kristin Jernigan conceived and designed the experiments, performed the experiments, 
analyzed the data, wrote the paper. 

• Seth R Bordenstein conceived and designed the experiments, analyzed the data, wrote 
the paper. 

Supplemental Information 

Supplemental information for this article can be found online at 
http://dx.doi.org/10.7717/peerj.264. 

REFERENCES 

Al-Khodor S, Price CT, Kalia A, Abu Kwaik Y. 2010. Functional diversity of ankyrin repeats in 

microbial proteins. Trends in Microbiology 18:132-139 DOl 10.1016/j.tim.2009.11.004. 
Bellgard MI, Wanchanthuek P, La T, Ryan K, Moolhuijzen P, Albertyn Z, Shaban B, Motro Y, 

Dunn DS, Schibeci D, Hunter A, Barrero R, Phillips ND, Hampson DJ. 2009. Genome 

sequence of tlie patliogenic intestinal spirochete brachyspira hyodysenteriae reveals adaptations 

to its lifestyle in the porcine large intestine. PLoS ONE 4:e4641 

DOl 10.1371/journal.pone.0004641. 
Biomatters. 2010. Geneious version 5.6.2. Available from http://www.geneious.com/. 
Bork P. 1993. Hundreds of ankyrin-like repeats in functionally diverse proteins: mobile modules 

that cross phyla horizontally? Proteins 17:363-374 DOl 10.1002/prot.340170405. 
Breeden L, Nasmyth K. 1987. Similarity between cell-cycle genes of budding yeast and fission 

yeast and the Notch gene of Drosophila. Nature 329:651-654 DOl 10.1038/329651a0. 
Darriba D, Taboada GL, Doallo R, Posada D. 2012. jModelTest 2: more models, new heuristics 

and parallel computing. Nature Methods 9:772 DOl 10.1038/nmeth.2109. 



Jernigan et al. (2014), PeerJ, 1 0.771 7/peerj.264 



] 14/17 



PeerJ 



de Felipe KS, Glover RT, Charpentier X, Anderson OR, Reyes M, Pericone CD, Shuman HA. 

2008. Legionella eukaryotic-like type IV substrates interfere with organelle trafficking. PLoS 

Pathogens 4x1000117 DOT 10.1371/journal.ppat.l000117. 
Deweerd KA, Suflita JM. 1990. Anaerobic aryl reductive dehalogenation of halobenzoates by cell 

extracts of "Desulfomonile tiedjei". Applied and Environmental Microbiology 56:2999-3005. 
Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high 

throughput. Nucleic Acids Reseach 32:1792-1797 DOI 10.1093/nar/gkh340. 
Fan Y, Thompson JW, Dubois LG, Moseley MA, Wernegreen JJ. 2013. Proteomic analysis of an 

unculturable bacterial endosymbiont (Blochmannia) reveals high abundance of chaperonins 

and biosynthetic enzymes. /ourMflZ o/Profeome i^esearch 12:704-718 DOI 10.1021/pr3007842. 
Flint A, Sun YQ, Stintzi A. 2012. Cjl386 is an ankyrin-containing protein involved in heme 

trafficking to catalase in Campylobacter jejuni. Journal of Bacteriology 194:334-345 

DOI 10.1128/JB.05740-11. 
Garcia-Garcia JC, Barat NC, Trembley SJ, Dumler JS. 2009. Epigenetic silencing of host cell 

defense genes enhances intracellular survival of the rickettsial pathogen Anaplasma 

phagocytophilum. PLoS Pathogens 5:el000488 DOI 10.1371/journal.ppat.l000488. 
Gorina S, Pavletich NP. 1996. Structure of the p53 tumor suppressor bound to the ankyrin and 

SH3 domains of 53BP2. Sde«ce 274:1001-1005 DOI 10.1126/science.274.5289.1001. 
Gough J, Karplus K, Hughey R, Chothia C. 2001. Assignment of homology to genome sequences 

using a library of hidden Markov models that represent all proteins of known structure. Journal 

of Molecular Biology 313:903-919 DOI 10.1006/jmbi.2001.5080. 
Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies 

by maximum likelihood. Systematic Biology 52:696-704 DOI 10.1080/10635150390235520. 
Hagai T, Azia A, Trizac E, Levy Y. 2012. Modulation of folding kinetics of repeat proteins: 

interplay between intra- and interdomain interactions. Biophysical Journal 103:1555- 1565 

DOI 10.1016/j.bpj.2012.08.018. 
Hasegawa M, Kishino H, Yano T. 1985. Dating of the human-ape splitting by a molecular clock of 

mitochondrial DNA./owrMfl/o/Mo/ecw/arEvo/wrtoM 22:160-174 DOI 10.1007/BF02101694. 
Huelsenbeck JP, Ronquist F. 2001. MRBAYES: Bayesian inference of phylogenetic trees. 

Bioinformatics 17:754-755 DOI 10.1093/bioinformatics/17.8.754. 
Inada H, Procko E, Sotomayor M, Gaudet R. 2012. Structural and biochemical consequences of 

disease-causing mutations in the ankyrin repeat domain of the human TRPV4 channel. 

Biochemistry 51:6195-6206 DOI 10.1021/bi300279b. 
Joutel A, Corpechot C, Ducros A, Vahedi K, Chabriat H, Mouton P, Alamowitch S, Domenga 

V, Cecillion M, Marechal E, Maciazek J, Vayssiere C, Cruaud C, Cabanis EA, Ruchoux MM, 

Weissenbach J, Bach JF, Bousser MG, Tournier-Lasserve E. 1996. Notch3 mutations in 

CADASIL, a hereditary adult-onset condition causing stroke and dementia. Nature 

383:707-710 DOI 10.1038/383707a0. 
JW IJdo, Carlson AC, Kennedy EL. 2007. Anaplasma phagocytophilum AnkA is 

tyrosine -phosphorylated at EPIYA motifs and recruits SHP-1 during early infection. Cellular 

Microbiology 9:1284:-1296 DOI 10.1111/j.l462-5822.2006.00871.x. 
Letunic I, Doerks T, Bork P. 2012. SMART 7: recent updates to the protein domain annotation 

resource. Nucleic Acids Research 40(D1):D302-D305 DOI 10.1093/nar/gki-931. 
Li J, Mahajan A, Tsai MD. 2006. Ankyrin repeat: a unique motif mediating protein -protein 

interactions. 5!oc/zemisfr>' 45:15168-15178 DOI 10.1021/bi062188q. 



Jernigan et al. (2014), PeerJ, 1 0.771 7/peerj.264 



I I 15/17 



PeerJ 



Maddison WP, Maddison DR. 2006. Mesquite: A modular system for evolutionary analysis. 
Version 1.1. 

Mappley LJ, Black ML, AbuOun M, Darby AC, Woodward MJ, Parkhill J, Turner AK, Bellgard 

MI, La T, Phillips ND, La Ragione RM, Hampson DJ. 2012. Comparative genomics of 

Brachyspira pilosicoli strains: genome rearrangements, reductions and correlation of genetic 

compliment with phenotypic diversity. SMC GeMomfcs 13:454 DOI 10.1186/1471-2164-13-454. 
Mello CC, Bradley CM, Tripp KW, Barrick D. 2005. Experimental characterization of the 

folding kinetics of the notch ankyrin domain. Journal of Molecular Biology 352:266-281 

DOI 10.1016/j.jmb.2005.07.026. 
Midford PE, Garland T Jr, Maddison WP. 2005. PDAP Package of Mesquite. Version 1.07. 
Mosavi LK, Cammett TJ, Desrosiers DC, Peng ZY. 2004. The ankyrin repeat as molecular 

architecture for protein recognition. Protein Science 13:1435-1448 DOI 10.1110/ps.03554604. 
NCBI Genome resource. Available at: http://www.ncbi.nlm.nih.gov/genome. 
NCBI Protein resource. Available at: http://www.ncbi.nlm.nih.gov/protein. 
Newton IL, Bordenstein SR. 2011. Correlations between bacterial ecology and mobile DNA. 

Current Microbiology 62:\9?,-2Q?, DOI 10.1007/s00284-010-9693-3. 
Pan X, Luhrmann A, Satoh A, Laskowski-Arce MA, Roy CR. 2008. Ankyrin repeat proteins 

comprise a diverse family of bacterial type IV effectors. Science 320:1651-1654 

DOI 10.1126/science.ll58160. 
Penz T, Schmitz-Esser S, Kelly SE, Cass BN, MuUer A, Woyke T, Malfatti SA, Hunter MS, Horn 

M. 2012. Comparative genomics suggests an independent origin of cytoplasmic incompatibility 

in Cardinium hertigii. PLoS Genetics 8:el003012 DOI 10.1371/journal.pgen.l003012. 
Poliakov A, Russell CW, Ponnala L, Hoops HJ, Sun Q, Douglas AE, van Wijk KJ. 201 1. 

Large-scale label-free quantitative proteomics of the pea aphid-Buchnera symbiosis. Molecular 

& Cellular Proteomics 10:M110 007039. 
Ponce G, Hoenicka J, Jimenez- Arriero MA, Rodriguez- Jimenez R, Aragues M, Martin- Sune N, 

Huertas E, Palomo T. 2008. DRD2 and ANKKl genotype in alcohol- dependent patients with 

psychopathic traits: association and interaction study. The British Journal of Psychiatry 

193:121-125. 

Reeve J, Abouheif E. 2003. Phylogenetic Independence. Version 2.0, Department of Biology, 
McGill University. 

Rocha EP, Matic I, Taddei F. 2002. Over-representation of repeats in stress response genes: a 

strategy to increase versatiUty under stressful conditions? Nucleic Acids Research 30: 1886- 1894. 
Ronquist F, Huelsenbeck JP. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed 

models. Bioinformatics 19:1572-1574. 
Schultz J, Milpetz F, Bork P, Ponting CP. 1998. SMART, a simple modular architecture research 

tool: identification of signaUng domains. Proceedings of National Academy of Sciences of United 

States of America 95:5857-5864. 
Sedgwick SG, Smerdon SJ. 1999. The ankyrin repeat: a diversity of interactions on a common 

structural framework. Trends in Biochemical Sciences 24:311-316. 
Shelton DR, Tiedje JM. 1984. General method for determining anaerobic biodegradation 

potential. Applied and Environmental Microbiology 47:850-857. 
Siozios S, loannidis P, Klasson L, Andersson SG, Braig HR, Bourtzis K. 2013. The diversity and 

evolution of Wolbachia ankyrin repeat domain genes. PLoS ONE 8:e55390. 



Jernigan et al. (2014), PeerJ, 1 0.771 7/peerj.264 



□ 16/17 



PeerJ 



Suerbaum S, Josenhans C, Sterzenbach T, Drescher B, Brandt P, Bell M, Droge M, Fartmann 

B, Fischer HP, Ge Z, Horster A, Holland R, Klein K, Konig J, Macko L, Mendz GL, 

Nyakatura G, Schauer DB, Shen Z, Weber J, Frosch M, Fox JG. 2003. The complete genome 

sequence of the carcinogenic bacterium Helicobacter hepaticus. Proceedings of National 

Academy of Sciences of United States of America 100:7901-7906. 
SUPERFAMILY. HMM library and genome assignments server Available at 

http://supfam.org/SUPERFAMILY/. 
Suraj Singh H, Ghosh PK, Saraswathy KN. 2013. DRD2 and ANKKl Gene polymorphisms and 

alcohol dependence: a case-control study among a mendelian population of east asian ancestry. 

Alcohol and Alcoholism 48:409-414. 
Tang KS, Fersht AR, Itzhaki LS. 2003. Sequential unfolding of ankyrin repeats in tumor 

suppressor pl6. Structure 11:67-73. 
ter Huurne AA, Gaastra W. 1995. Swine dysentery: more unknown than known. Veterinary 

Microbiology 46:347-360. 
Wetzel SK, Settanni G, Kenig M, Binz HK, Pluckthun A. 2008. Folding and unfolding 

mechanism of highly stable full-consensus ankyrin repeat proteins. Journal of Molecular Biology 

376:241-257. 

Wilson D, Pethica R, Zhou Y, Talbot C, Vogel C, Madera M, Chothia C, Gough J. 2009. 

SUPERFAMILY-sophisticated comparative genomics, data mining, visuahzation and 
phylogeny. Nucleic Acids Research 37:D380-386. 
Zhu B, Nethery KA, Kuriakose JA, Wakeel A, Zhang X, McBride JW. 2009. Nuclear translocated 
Ehrlichia chaffeensis ankyrin protein interacts with a specific adenine-rich motif of host 
promoter and intronic Alu elements. Infection and Immunity 77:4243-4255. 



Jernigan et al. (2014), PeerJ, 1 0.771 7/peerj.264 



17/17 



